About Data Analysis Report

This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Sun Jun 16 19:39:48 2024.

Data Description:

This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.

Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

Relevant Paper:

Fanaee-T, Hadi, and Gama, Joao, ‘Event labeling combining ensemble detectors and background knowledge’, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg

Setting up our environment loading up the required packages

We initialize loading the libraries needed to conduct our analysis, including proper ones used for cleaning, transforming, visualizing and analyzing the data.

## Import required packages
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
# Load the data
bikeday <- read_csv("bike_day_rental.csv")
## Rows: 731 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (15): instant, season, yr, mnth, holiday, weekday, workingday, weathers...
## date  (1): dteday
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

##Describing and visualizing the data We shall initialize our analysis by visualizing our data.

View(bikeday)

After taking a glimpse into the dataset, we are able to see there are several factors being referenced here, all of these which could be potential causes as to why demand fluctuates from time to time.

Since the factors to analyze would be too many for a simple regression, or multiple regression model, we opt to proceed instead with a time series model, taking only into consideration the demand, in this case the column shown as “cnt” and the date, shown as “dteday”.

Let’s go ahead and graph only these 2 factors we shall be conducting our analysis with, ahead:

ggplot(bikeday, aes(x = as.Date(dteday), y = cnt)) +
  geom_line() +
  labs(title = "Daily Bike Rentals", x = "Date", y = "Count")

What we can analyze in this graph is a sort of bounce in the demand for bicycles rising from january to july and then dropping back down onwards to december. This shows us the highest time of demand is around the the second trimester of the year (T2).

Creating an interactive time series plot for more specific analysis

p <- ggplot(bikeday, aes(x = as.Date(dteday), y = cnt)) +
  geom_line() +
  labs(title = "Daily Bike Rentals", x = "Date", y = "Count")

ggplotly(p)

Through this interactive plot we were able to create with the assistance of the “plotly” library, we can run our cursor around the plot and select key points we want to view.

Smoothing our time series data

We want to smooth our data now to ease our analysis and help towards a more effective forecast analysis.

bikeday <- bikeday %>%
  mutate(cnt_smooth = zoo::rollmean(cnt, k = 7, fill = NA))

ggplot(bikeday, aes(x = as.Date(dteday))) +
  geom_line(aes(y = cnt), color = "blue") +
  geom_line(aes(y = cnt_smooth), color = "green") +
  labs(title = "Daily Bike Rentals (Smoothed)", x = "Date", y = "Count")

Decomposing and accessing the stationarity of time series data

By decomposing our data, this helps to isolate and understand its underlying components like trend, seasonality, and noise, making it easier to model and forecast accurately.

Assessing the stationarity of time series data ensures its statistical properties remain consistent over time, which is crucial for applying reliable forecasting models and interpreting data trends effectively.

# Convert to time series object
bikeday_ts <- ts(bikeday$cnt, start = c(2011, 1), frequency = 365)

# Decompose the time series
bikeday_decomp <- decompose(bikeday_ts)

# Plot decomposition
plot(bikeday_decomp)

Fitting and forecasting time series data using ARIMA models

R has many tools to our disposition when it comes to statistics, in this case we will be making use of the ARIMA (Autoregressive Integrated Moving Average) functions to our availability. And we shall also put into good the use the “Forecast” library as well.

bikeday_model <- auto.arima(bikeday_ts, ic = "aic", trace = TRUE)
## 
##  Fitting models using approximations to speed things up...
## 
##  ARIMA(2,0,2)(0,1,0)[365] with drift         : 533.6237
##  ARIMA(0,0,0)(0,1,0)[365] with drift         : 605.7292
##  ARIMA(1,0,0)(0,1,0)[365] with drift         : 541.9535
##  ARIMA(0,0,1)(0,1,0)[365] with drift         : 556.3631
##  ARIMA(0,0,0)(0,1,0)[365]                    : 1056.396
##  ARIMA(1,0,2)(0,1,0)[365] with drift         : 531.0975
##  ARIMA(0,0,2)(0,1,0)[365] with drift         : 545.1162
##  ARIMA(1,0,1)(0,1,0)[365] with drift         : 540.8333
##  ARIMA(1,0,3)(0,1,0)[365] with drift         : 545.6059
##  ARIMA(0,0,3)(0,1,0)[365] with drift         : 542.9662
##  ARIMA(2,0,1)(0,1,0)[365] with drift         : 531.8038
##  ARIMA(2,0,3)(0,1,0)[365] with drift         : Inf
##  ARIMA(1,0,2)(0,1,0)[365]                    : Inf
## 
##  Now re-fitting the best model(s) without approximations...
## 
##  ARIMA(1,0,2)(0,1,0)[365] with drift         : 6273.517
## 
##  Best model: ARIMA(1,0,2)(0,1,0)[365] with drift
# Print model summary
summary(bikeday_model)
## Series: bikeday_ts 
## ARIMA(1,0,2)(0,1,0)[365] with drift 
## 
## Coefficients:
##          ar1      ma1      ma2   drift
##       0.9586  -0.6363  -0.1892  5.7093
## s.e.  0.0283   0.0583   0.0506  0.7566
## 
## sigma^2 = 1599566:  log likelihood = -3131.76
## AIC=6273.52   AICc=6273.68   BIC=6293.03
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 5.357075 890.0137 457.0405 -44.28372 51.73145 0.1967752 0.01047274
bikeday_forecast <- forecast(bikeday_model, h = 365)

# Plot the forecast
plot(bikeday_forecast)

In the plot we are able to see, colored in blue, the forecast for the next year in bike demand. This includes any major/minor variation to the forecasted number and will help as a starting point for inventory, availability and more.

Findings and Conclusions

Findings:

Conclusions:

Additional Resources